November 10, 2023

Setup and Code Availability

Problem Set-Up

Consider a simple study of the microbiome pre/post antibiotic administration.

  • Research question: Which taxa change in absolute abundance after taking an antibiotic?
  • How do we analyze this data to answer the question?

Sequence Count Data as a Sample

Gap between Data and System

Implications for Analysis

Options for Analysis

  1. Use an existing method for differential abundance.
    • ALDEx2, DESeq2, edgeR, etc.
    • These methods rely on normalizations.
  2. Reframe the question (CoDA).
    • Data are compositional.
    • Differential abundance cannot be done rigorously.

Problems with these Options

  • Option 1: Disconnect between question and normalization can lead to elevated type-I and type-II errors. Why?

  • Option 2: Can’t answer many research questions.

New Option: Scale Reliant Inference

A middle ground approach:

  • \(Y\) is a measurement of the underlying system \(W\).

  • Desired quantity depends on \(W\) (i.e., \(\theta = f(W)\)). However, \(W\) depends on both the composition and system scale, i.e.:

\[W_{dn} = W_{dn}^\parallel W_n^\perp\] \[W_n^\perp = \sum_{d=1}^D W_{dn}\]

Scale Simulation Random Variables

  1. Draw samples of \(W^{\parallel}\) from a measurement model (can depend on \(Y\)).
  2. Draw samples of \(W^{\perp}\) from a scale model (can depend on \(W^{\parallel}\)).
  3. Calculate samples of \(\theta = f(W^\parallel, W^\perp)\).

Adding Scale to ALDEx2

ALDEx2 Software Suite

ALDEx2 Software Suite

Unacknowledged Bias within ALDEx2

Step 2 places an assumption on system scale.

  • Consider that \(\log W_{dn} = \log W_{dn}^\parallel + \log W_n^\perp\).

  • We can show that the CLR transformation corresponds to the assumption:

\[\log W_n^\perp = - \mathrm{mean}(\log \hat{W}_{\cdot n}^\parallel).\]

Moving Past Normalizations to Scale within ALDEx2

Scale Models in ALDEx2

Normalizations are replaced by a scale model:

\[\log W_n^\perp \sim Q\]

There are no restrictions on \(Q\), although there are some helpful options:

  1. Based on normalizations.
  2. Based on biological knowledge.
  3. Based on outside measurements.

New Additions to ALDEx2

Data Simulation

We have:

  • 2 conditions (pre- and post- administration) with 50 samples per condition.
  • 20 taxa: 4 change between conditions, 16 don’t.

To simulate the data, we:

  1. Use the Poisson distribution to sample true abundances (\(W\)).
  2. Use Multinomial sampling to sub-sample \(W\) to an arbitrary sequencing depth (\(Y\)).

Simulation functions can be found in scripts > data_simulation.R.

Reading in the Simulated Data

###Analysis file for simulated data
library(ALDEx2)
library(tidyverse)
library(ggplot2)
library(ggpattern)
library(cowplot)

set.seed(12345)
setwd('..')
##Reading in data (see "data_simulation.R" for details.)
rdat <- read.csv(file.path("data/simulation/sim_seq_dat.csv"))
flow_data <- read.csv(file.path("data/simulation/sim_flow_dat.csv"))

Inspecting the Data

## "Y" represents the OTU table
Y <- t(rdat[,-1])
Y[1:4,1:4]
##       [,1] [,2] [,3] [,4]
## Taxa1  657  672  652  676
## Taxa2  669  697  634  652
## Taxa3  663  675  660  675
## Taxa4  666  668  633  670
##Vector denoting whether samples was in
##pre- or post- antibiotic condition.
conds <- as.character(rdat[,1])
conds[c(1:4,51:55)]
## [1] "Pre"  "Pre"  "Pre"  "Pre"  "Post" "Post" "Post" "Post" "Post"

Effects of Unacknowledged Bias

## Fitting and analyzing the original ALDEx2
## model
mod.base <- aldex(Y, conds)
print(row.names(mod.base %>%
    filter(we.eBH < 0.05)))

[1] “Taxa1” “Taxa2” “Taxa3” “Taxa4” “Taxa5” “Taxa6” “Taxa7” “Taxa8” [9] “Taxa9” “Taxa10” “Taxa11” “Taxa12” “Taxa13” “Taxa14” “Taxa15” “Taxa16” [17] “Taxa17” “Taxa18” “Taxa19” “Taxa20”

Including scale through gamma.

  • The argument gamma has been added to the aldex and aldex.clr function.

  • gamma can either be a single numeric or a matrix.

    1. Single numeric: controls the noise on the default scale model.
    2. Matrix: A \(N \times S\) matrix of samples of \(W\).

Option 1: Default Scale Model

The default scale model is based on errors in the CLR normalization.

\[\log \hat{W}_{n}^{\perp(s)} = - \mathrm{mean} \left(\log \hat{W}^{\parallel (s)}_{\cdot n}\right) + \Lambda^\perp x_{n}\] \[\Lambda^\perp \sim \ N(0, \gamma^2).\]

Decoding the Default Scale Model

If \(\gamma \rightarrow 0\), the default scale model is equal to the CLR, i.e.,

\[\log \hat{W}_{n}^{\perp(s)} = - \mathrm{mean} \left(\log \hat{W}^{\parallel (s)}_{\cdot n}\right).\]

Thus, the noise component controls deviations from this assumption.

Decoding the Default Scale Model

\[\log \hat{W}_{n}^{\perp(s)} = - \mathrm{mean} \left(\log \hat{W}^{\parallel (s)}_{\cdot n}\right) + \Lambda^\perp x_{n}\] \[\Lambda^\perp \sim \ N(0, \gamma^2).\]

  • By the empirical rule, 95% of samples from \(\Lambda^\perp\) fall within \(2 \gamma\) of zero.
  • This implies that, with 95% certainty, the scales between conditions fall within the range (\(2^{\hat{\theta}^\perp-2 \gamma}, 2^{\hat{\theta}^\perp+2 \gamma}\)) where \(\hat{\theta}^\perp\) is what is implied by the CLR.

Re-creating the CLR with gamma.

We can recreate the CLR by letting \(\gamma \rightarrow 0\).

mod.clr <- aldex(Y, conds, gamma = 1e-3)
plot(mod.base$effect, mod.clr$effect, xlab = "Original ALDEx2 
  Effect Size", ylab = "CLR Scale Model Effect Size")
abline(a=0,b=1, col = "red", lty = "dashed")

How to Choose \(\gamma\)

We recommend a value of \(\gamma = 0.50\).

  • This corresponds the the assumption that 95% of the scale samples are within \((0.5, 2)\) of the geometric mean.
  • In words, 95% of the scale samples are within half or double of the amount implied by the normalization.
  • Any value of \(\gamma > 0\) is an improvement over the status quo!!

Option 1: Examples (Some Noise)

## Adding noise via the default scale model
mod.ss <- aldex(Y, conds, gamma = .25)
mod.ss %>% filter(we.eBH < 0.05)
##          rab.all rab.win.Post rab.win.Pre  diff.btw  diff.win   effect
## Taxa4   1.746962     1.527557    2.207001 0.6853050 0.2715392 2.471074
## Taxa15 -1.444189    -1.824995   -1.064686 0.7788118 0.5294582 1.393110
## Taxa20 -1.926847    -2.786268   -1.091654 1.7237804 0.5934539 2.902773
##             overlap        we.ep       we.eBH        wi.ep       wi.eBH
## Taxa4  0.0053126904 1.767374e-08 1.767374e-07 1.168572e-07 1.168572e-06
## Taxa15 0.0531084217 5.146644e-06 3.431096e-05 2.457754e-05 1.638503e-04
## Taxa20 0.0003155441 2.274644e-24 4.548065e-23 7.454892e-18 8.059931e-17

3 of 20 taxa (all true positives) are significant!

Option 1: Examples (More Noise)

## Adding more noise via the default scale model
mod.ss.high <- aldex(Y, conds, gamma = 1)
mod.ss.high %>% filter(we.eBH < 0.05)
##         rab.all rab.win.Post rab.win.Pre diff.btw diff.win   effect    overlap
## Taxa20 -2.29671    -2.797038  -0.9882125 1.843279  1.12733 1.512371 0.04968752
##             we.ep    we.eBH      wi.ep     wi.eBH
## Taxa20 0.03204373 0.0404344 0.03330432 0.04114547

1 of 20 taxa (all true positives) are significant!

Option 2: Create Your Own Scale Matrix

Alternatively, can pass a matrix of scale samples to gamma so long as:

  1. The dimension is \(N \times S\).
  2. They are samples of \(W^\perp\) not \(\log W^\perp\).

Example 1: Based on Biology

  • Scale is guided by the biological system or the researcher’s prior beliefs.

  • For example, suppose we expect the antibiotic to decrease the total scale by 10%. This can be written in a scale model:

\[\log \hat{W}_n^{(s)} \sim N(1.0, \gamma^2) \, \, \mathrm{if} \, \, x_n = \mathrm{``pre"}\] \[\log \hat{W}_n^{(s)} \sim N(0.9, \gamma^2) \, \, \mathrm{if} \, \, x_n = \mathrm{``post"}.\]

Example 1: Based on Biology

  • The aldex.makeScaleMatrix function can help with this.

  • It will sample from a log Normal distribution given a mean vector for each condition.

  • For example, if there are only two samples, aldex.makeScaleMatrix(1, c(1,.9), conds <- c("pre", "post"), log = FALSE) will sample a 10% decrease in means between conditions.

  • Note: The absolute scale doesn’t matter. It is the relationship between conditions.

Example 1: Based on Biology

##Creating an informed model using biological reasoning
scales <- c(rep(1, 50), rep(0.9, 50))
scale_samps <- aldex.makeScaleMatrix(.15, scales, conds, log=FALSE)
mod.know <- aldex(Y, conds, gamma = scale_samps)
mod.know %>% filter(we.eBH < 0.05)
##          rab.all rab.win.Post rab.win.Pre diff.btw  diff.win   effect
## Taxa3  -2.598439    -2.847642   -2.365482 0.484869 0.3519212 1.311713
## Taxa4  -2.885418    -3.406038   -2.389054 1.011183 0.3536051 2.808902
## Taxa15 -6.156702    -6.758008   -5.667178 1.086641 0.5772437 1.857344
## Taxa20 -6.635045    -7.727209   -5.697487 2.049607 0.6622978 3.061246
##             overlap        we.ep       we.eBH        wi.ep       wi.eBH
## Taxa3  7.403938e-02 2.253685e-12 1.126842e-11 5.458474e-11 2.729237e-10
## Taxa4  9.385795e-04 7.615935e-34 7.701809e-33 8.480967e-18 8.543692e-17
## Taxa15 2.030620e-02 3.915641e-20 2.610427e-19 4.482331e-16 2.988221e-15
## Taxa20 2.193153e-05 2.897371e-31 2.898034e-30 6.973752e-18 8.360892e-17

Example 2: Outside Measurements

  • There has been a push to collect outside measurements (e.g., qPCR, flow cytometry).

  • These can be used in building a scale model if they are informative on the scale of interest.

  • Suppose we have access to flow cytometry measurements \((z_{1n}, z_{2n}, z_{3n})\). We could build a scale model of the form: \[\log \hat{W}^{\perp (s)}_{n} \sim N(\mathrm{mean}(z_{1n}, z_{2n}, z_{3n}),\text{var}(z_{1n}, z_{2n}, z_{3n})).\]

Example 2: Flow Cytometry Data

##Inspeciting our flow cytometry data
head(flow_data)
##   sample     flow
## 1      1 30212.92
## 2      1 29536.80
## 3      1 30727.71
## 4      2 30168.53
## 5      2 29740.54
## 6      2 29891.60

Example 2: Flow Cytometry Data

## Now creating an informed model using the flow data
flow_data_collapse <- flow_data %>%
    group_by(sample) %>%
    mutate(mean = mean(flow)) %>%
    mutate(stdev = sd(flow)) %>%
    dplyr::select(-flow) %>%
    ungroup() %>%
    unique()

scale_samps <- matrix(NA, nrow = nrow(flow_data_collapse), ncol = 128)
for (i in 1:nrow(scale_samps)) {
    scale_samps[i, ] <- rnorm(128, flow_data_collapse$mean[i],
        flow_data_collapse$stdev[i])
}

Example 2: Flow Cytometry Data

mod.flow <- aldex(Y, conds, gamma = scale_samps)
mod.flow %>%
    filter(we.eBH < 0.05)
##          rab.all rab.win.Post rab.win.Pre  diff.btw  diff.win   effect
## Taxa3  11.743813    11.538037   11.976700 0.4368715 0.1263678 3.415176
## Taxa4  11.493789    10.981428   11.945142 0.9695842 0.1348510 7.153833
## Taxa15  8.202953     7.627722    8.675126 1.0556408 0.4649931 2.202673
## Taxa20  7.750293     6.662892    8.648625 1.9977850 0.5450170 3.632701
##             overlap        we.ep       we.eBH        wi.ep       wi.eBH
## Taxa3  3.155441e-04 4.727398e-39 3.652625e-38 7.061281e-18 4.537357e-17
## Taxa4  2.193153e-05 7.049649e-59 1.409930e-57 6.856641e-18 4.517641e-17
## Taxa15 4.687716e-03 2.427717e-24 1.213859e-23 2.000151e-17 1.000870e-16
## Taxa20 2.193153e-05 2.110916e-32 1.369811e-31 6.859963e-18 4.517641e-17

Comparing Results between Scale Models

Sensitivity Analyses

  • How do we choose a value of \(\gamma\)?

  • Sensitivity analyses allow us to understand how an entity’s significance changes with varying amounts of scale noise.

  • Several built-in functions help with this: aldex.senAnalysis and plotGamma.

Sensitivity Analyses in ALDEx2

To run a sensitivity analysis in ALDEx2 using the default scale model, do the following:

  1. Specify a vector of values of \(\gamma\) to test.
  2. Run aldex.clr on the OTU table and conditions.
  3. Run aldex.senAnalysis on the clr object and values of \(\gamma\).
  4. (Optional) Use plotGamma to visualize results.

Example: Sensitivity Analyses

##Now running a sensitivity analysis over the default scale model

##First, specifying different values for the noise in the scale
gamma_to_test <- c(1e-3, .1, .25, .5, .75, 1, 2, 3, 4, 5)

##Run the CLR function
clr <- aldex.clr(Y, conds)

##Run sensitivity analysis function
sen_res <- aldex.senAnalysis(clr, gamma = gamma_to_test)

##Plotting the sensitivity results.
plotGamma(sen_res, thresh = .1)

Example: Sensitivity Analyses

Example: Sensitivity Analyses

A Real-Data Example with SELEX

Data Description

  • SELEX: Selected Growth Experiment.
  • Goal is to determine which of 1,600 gene variants confer a growth phenotype.
  • Study was conducted so that those variants that confer a growth phenotype would increase in the presence of a bacteriostatic toxin.

Data Description

  • Two conditions present: selected and non-selected.
    1. Selected: Bacteriostatic toxin allows for selected growth in some cell lines.
    2. Non-selected: Growth inhibitor leads to no growth in any of the cell lines.
  • Originally published in McMurrough et. al (2014) and discussed in Fernandes et. al (2014).

Reading in the Data

The ‘selex’ data is built-in to ‘ALDEx2’.

library(ALDEx2)
library(tidyverse)
library(ggrepel)

conds <- c(rep("NS", 7), rep("S", 7))

data(selex)
dim(selex)
## [1] 1600   14
selex[1:4,1:4]
##         X1_ANS X1_BNS X1_CNS X1_DNS
## A:D:A:D    347    271    396    317
## A:D:A:E    436    361    461    241
## A:E:A:D    476    288    378    215
## A:E:A:E    513    307    424    381

We have access to ground truth data.

##Ground truth
setwd('..')
true.pos <- read.delim(file.path("data","selex","selex-truth.txt"))$X
true.pos
##  [1] "A:E:G:E" "A:D:G:E" "A:E:G:D" "A:D:G:D" "G:E:A:E" "G:D:A:E" "G:E:A:D"
##  [8] "G:D:A:D" "A:E:A:E" "A:D:A:E" "A:E:A:D" "A:D:A:D" "G:E:G:E" "G:D:G:E"
## [15] "G:E:G:D" "G:D:G:D" "S:E:G:E" "S:D:G:E" "S:E:G:D" "S:D:G:D" "G:E:S:E"
## [22] "G:D:S:E" "G:E:S:D" "G:D:S:D" "A:E:S:E" "S:E:A:E" "S:E:S:E"

Modeling with ALDEx2 (no scale)

##ALDEx2
aldex.mod <- aldex(selex,conds)
aldex.pos <- aldex.mod %>% filter(we.eBH < 0.05)
table(row.names(aldex.pos) %in% true.pos)
## 
## FALSE  TRUE 
##    50    18

Modeling with ALDEx2 (default scale)

##Adding scale
aldex.scale <- aldex(selex, conds, gamma = .5)
aldex.scale.pos <- aldex.scale %>% filter(we.eBH < 0.05)
table(row.names(aldex.scale.pos) %in% true.pos)
## 
## FALSE  TRUE 
##    39    18

Testing over Gamma (default scale)

##Plotting TP and FP over different levels of gamma
gamma.to.test <- c(1e-3,.25,.5,.75,1,2,3, 4, 5)

TP <- rep(NA, length(gamma.to.test))
FP <- rep(NA, length(gamma.to.test))

for(i in 1:length(gamma.to.test)){
  mod <- aldex(selex,conds,gamma = gamma.to.test[i])
  mod.pos <- row.names(mod %>% filter(we.eBH < 0.05))

  TP[i] <- sum(mod.pos %in% true.pos)
  FP[i] <- sum(!(mod.pos %in% true.pos))
}

Testing over Gamma (default scale)

graph.df <- data.frame("Gamma" = rep(gamma.to.test, 2),
                       "Counts" = c(TP, FP),
                       "Group" = c(rep("TP", length(gamma.to.test)),
                                   rep("FP", length(gamma.to.test))))

ggplot(graph.df, aes(x = Gamma, y = Counts,
                     group = Group, color = Group)) +
  geom_line(aes(linetype = Group), lwd = 1.25) +
  scale_color_manual(values = c("red3", "royalblue3")) +
  scale_linetype_manual(values=c("twodash", "longdash")) +
  theme_bw() +
  ylab("Counts")

Testing over Gamma (default scale)

Built-In Sensitivity Analyses

##Built-in sensitivity analysis
clr <- aldex.clr(selex, conds)

##Run sensitivity analysis function
sen_res <- aldex.senAnalysis(clr, gamma = gamma.to.test)

##Plotting the sensitivity results.
plotGamma(sen_res, thresh = .1)

Built-In Sensitivity Analyses

References

  • Nixon, et. al. (2023) “Scale Reliant Inference.” ArXiv.

  • Gloor, Nixon, and Silverman. (2023) “Scale is Not What You Think; Explicit Scale Simulation in ALDEx2.” BioRXiv.

  • Nixon, Gloor, and Silverman. (2023) “Beyond Normalizations: Incorporating Scale Uncertainty in ALDEx2.” Forthcoming.

  • McMurrough et. al. (2014).”Control of catalytic efficiency by a coevolving network of catalytic and noncatalytic residues.” PNAS.

  • Fernandes et. al. (2014). “Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis.” Microbiome.